Part I - California Housing¶

by Timothy Adebisi¶

Introduction¶

The data to be analyzed is the California housing price dataset downloaded from kaggle. The link to the dataset is available here.

Preliminary Wrangling¶

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px

%matplotlib inline
In [2]:
# Load the dataset
df = pd.read_csv('housing.csv')
df.head()
Out[2]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
In [3]:
# Check the shape of the dataset
df.shape
Out[3]:
(20640, 10)
In [4]:
# Check the data type
df.dtypes
Out[4]:
longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
ocean_proximity        object
dtype: object
In [5]:
# Check for null values
df.isna().sum()
Out[5]:
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64
In [6]:
# Drop null values
df.dropna(axis=0, inplace=True)
In [7]:
df.isna().sum()
Out[7]:
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64
In [8]:
df.housing_median_age.unique()
Out[8]:
array([41., 21., 52., 42., 50., 40., 49., 48., 51., 43.,  2., 46., 26.,
       20., 17., 36., 19., 23., 38., 35., 10., 16., 27., 39., 31., 29.,
       22., 37., 28., 34., 32., 47., 44., 30., 18., 45., 33., 24., 15.,
       14., 13., 25.,  5., 12.,  6.,  8.,  9.,  7.,  3.,  4., 11.,  1.])
In [9]:
# Change the datatype of some features from `float` to `int`
obs = ['housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households']

for v in obs:
    df[v] = df[v].astype('int')
In [10]:
# Test
df.dtypes
Out[10]:
longitude             float64
latitude              float64
housing_median_age      int32
total_rooms             int32
total_bedrooms          int32
population              int32
households              int32
median_income         float64
median_house_value    float64
ocean_proximity        object
dtype: object
In [11]:
# Statistical summary
df.describe().transpose()
Out[11]:
count mean std min 25% 50% 75% max
longitude 20433.0 -119.570689 2.003578 -124.3500 -121.8000 -118.4900 -118.010 -114.3100
latitude 20433.0 35.633221 2.136348 32.5400 33.9300 34.2600 37.720 41.9500
housing_median_age 20433.0 28.633094 12.591805 1.0000 18.0000 29.0000 37.000 52.0000
total_rooms 20433.0 2636.504233 2185.269567 2.0000 1450.0000 2127.0000 3143.000 39320.0000
total_bedrooms 20433.0 537.870553 421.385070 1.0000 296.0000 435.0000 647.000 6445.0000
population 20433.0 1424.946949 1133.208490 3.0000 787.0000 1166.0000 1722.000 35682.0000
households 20433.0 499.433465 382.299226 1.0000 280.0000 409.0000 604.000 6082.0000
median_income 20433.0 3.871162 1.899291 0.4999 2.5637 3.5365 4.744 15.0001
median_house_value 20433.0 206864.413155 115435.667099 14999.0000 119500.0000 179700.0000 264700.000 500001.0000

What is the structure of your dataset?¶

There are 20,640 observations and 10 features in the dataset and has 207 missing observations which were dropped. Most of the variables are numeric except the ocean_proximity which is a nominal categorical feature

What is/are the main feature(s) of interest in your dataset?¶

I am interested in figuring out how the median_house_value varies in location

What features in the dataset do you think will help support your investigation into your feature(s) of interest?¶

I expect the proximity of apartments to the watersides, the longitude & latitude and total rooms to have a huge effect in the prices of apartments.

Univariate Exploration¶

In [12]:
df.sample(5)
Out[12]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
13112 -121.35 38.40 11 2322 459 1373 424 3.1750 94400.0 INLAND
5419 -118.43 34.02 38 2172 437 830 368 3.9091 500001.0 <1H OCEAN
20085 -120.29 38.01 12 3014 560 1424 485 3.0729 105100.0 INLAND
19111 -122.64 38.23 49 2300 463 1061 429 4.0750 228800.0 <1H OCEAN
20244 -119.25 34.27 35 2532 407 1338 422 4.7727 219000.0 NEAR OCEAN

Number of Houses in each Proximity¶

In [45]:
base_color = sb.color_palette()[0]
sb.countplot(data=df, x='ocean_proximity', color=base_color)
plt.title('Ocean Proximity Bars');

Most of the houses in the dataset are less than an hour to the ocean followed by those in the inland.

In [30]:
# Define a function to plot Histogram

def histogram(DataFrame, x_value, x_label, y_label, title, nbins):
    plt.hist(data=DataFrame, x=x_value, bins=nbins)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.title(title);

Distribution of House Median Age.¶

In [31]:
# Histogram of House Median Age
histogram(df, 
          'housing_median_age', 
          'House Median Age', 
          'Frequency', 
          'House Median Age',
           10)

The distribution is slightly skewed to the left and has a peak a little above 30. Most of the houses are above the median age of 10.

Distribution of Median House Value¶

In [33]:
# Histogram of House Value
histogram(df,
          'median_house_value',
          'House Value',
          'Frequency',
          'House Value',
          10)

The House value is skewed to the right with a peak just after 100,000, this is interesting. Is like most of the houses that are over 10 years old are not that costly or is more affordable. Lets look at the distribution of their income below.

Distribution of Household Income¶

In [44]:
# Histogram of Household Income
histogram(df, 'median_income', 'Income [Thousand USD]', 'Frequency', 'Income', 20)

The Income is also largely skewed to the right.

Rooms Distribution¶

In [46]:
histogram(df, 'total_rooms', 'Rooms', 'Frequency', 'Rooms Distribution', 100)

The rooms are skewed to the right and compact between 0 and 5000.

Bedroom Distribution¶

In [47]:
histogram(df, 'total_bedrooms', 'Bedooms', 'Frequency', 'Bedrooms Distribution', 100)

The bedrooms are also skwed to the right.

Population Distribution¶

In [48]:
histogram(df, 'population', 'Population', 'Frequency', 'Population', 100)

The population is skewed to the right just like the rooms and bedrooms.

Household Distribution¶

In [50]:
histogram(df, 'households', 'Households', 'Frequency', 'Households', 100)

The households follows similar pattern as the rooms, bedrooms and population. It is skwed to the right and have similar characterstics

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?¶

Most of the houses are valued within the range of 50,000 to 250,000 with peak around 200,000.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?¶

The income and the house value is also skewed to the right. Possibly the house income is proportion to the house value.

Bivariate Exploration¶

First, lets take a look at how the numeric features corelate

In [19]:
df.columns
Out[19]:
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')
In [20]:
# numeric variables for corellation
n_vars = ['housing_median_age', 'total_rooms', 'total_bedrooms', 
          'population', 'households', 'median_income', 'median_house_value']

# Corellation
plt.figure(figsize=(6,6))
sb.heatmap(df[n_vars].corr(), annot = True,
           cmap = 'vlag_r', center = 0);
In [40]:
# Define a function for scatter plot.
def scatter(DataFrame, x_value, x_label, y_value, y_label, title):
    sb.scatterplot(data=DataFrame, x=x_value, y=y_value)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.title(title);

Income and House Value¶

In [42]:
# Scatter plot of house value and income
scatter(df, 'median_income', 'Income [Thousand USD]', 'median_house_value', 'House Value [USD]', 'Income vs House Value')

There is a strong correlation between income and house value which confirms our assumption.

Housing Age and House Value¶

In [43]:
# Scatter plot of house age and house value
scatter(df, 'housing_median_age', 'Housing Age [Years]', 
        'median_house_value', 'House Value [USD]', 'Housing Age vs House Value')

There is no correlation between age of the houses and their prices. House age is not a determining factor when considering its price.

Ocean Proximity and House Value¶

In [27]:
sb.violinplot(data=df, x='ocean_proximity', y='median_house_value', color=base_color, inner='box')
plt.xticks(rotation=15);

The median of house value on the Island is relatively high compared to other location. Could that mean there are more rooms on the Island or its just relatively on the high side. Lets look at the distribution of total rooms ineach proximity.

Ocean Proximity, House Value and Rooms¶

In [59]:
# Derive the density of house value and rooms
df['density'] = df['median_house_value'] / df['total_rooms']

# Groupby ocean proximity and five the average of the density
df1 = df.groupby('ocean_proximity')['density'].mean()
df1
Out[59]:
ocean_proximity
<1H OCEAN     160.283393
INLAND        117.289168
ISLAND        287.767762
NEAR BAY      216.399820
NEAR OCEAN    181.834438
Name: density, dtype: float64
In [62]:
# Bar chat of Density and Proximity
df1.plot.bar()
plt.xlabel('Ocean Proximity')
plt.ylabel('Density')
plt.title('Density vs Proximity');

Invariably, houses on the Island are the most expensive.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?¶

The plot confirms that household income strongly correlate with the value of the house. Generally, people purchage what they can afford. The age of the house is not a determining factor that can be considered when purchasing a house.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?¶

Most prices of houses on the island are quite expensive but theres a fair distribution for houses near bay and ocean. The age of the houses have no relationship with the value of the houses.

Multivariate Exploration¶

Housing, price and Location¶

In [24]:
fig = px.scatter_mapbox(df,
                        lat='latitude',
                        lon='longitude',
                        center={'lat':37.09, 'lon':-121},
                        height=600,
                        width=600,
                        color='median_house_value',
                        hover_data=['ocean_proximity'])
fig.update_layout(mapbox_style='open-street-map', title='Housing Price and Location')
fig.show()

The closer the houses are to the ocean or the Bay Area, the higher the prices.

In [63]:
g=sb.PairGrid(data=df, vars=n_vars)
g.map_diag(sb.histplot)
g.map_offdiag(plt.scatter);

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?¶

This confirms that price of houses close to the ocean are expensive.

Were there any interesting or surprising interactions between features?¶

The previous deductions were strengthened.

Conclusions¶

The following are the conclusion derived from the analysis:

  • The closer the houses are to the bay area or the ocean, the higher the prices.
  • The houses on the Island are the most expensive.
  • The Age of the houses does not correlate with its price.
  • The household income correlate with the value of the house.